{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "xfcSMZEIFA_l"
   },
   "source": [
    "# ML with TensorFlow Extended (TFX) -- Part 1\n",
    "The puprpose of this tutorial is to show how to do end-to-end ML with TFX libraries on Google Cloud Platform. This tutorial covers:\n",
    "1. Data analysis and schema generation with **TF Data Validation**.\n",
    "2. Data preprocessing with **TF Transform**.\n",
    "3. Model training with **TF Estimator**.\n",
    "4. Model evaluation with **TF Model Analysis**.\n",
    "\n",
    "This notebook has been tested in Jupyter on the Deep Learning VM.\n",
    "\n",
    "## Setup Cloud environment"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import tensorflow as tf\n",
    "import tensorflow_data_validation as tfdv\n",
    "\n",
    "print('TF version: {}'.format(tf.__version__))\n",
    "print('TFDV version: {}'.format(tfdv.__version__))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "PROJECT = 'cloud-training-demos'    # Replace with your PROJECT\n",
    "BUCKET = 'cloud-training-demos-ml'  # Replace with your BUCKET\n",
    "REGION = 'us-central1'              # Choose an available region for Cloud MLE\n",
    "\n",
    "import os\n",
    "\n",
    "os.environ['PROJECT'] = PROJECT\n",
    "os.environ['BUCKET'] = BUCKET\n",
    "os.environ['REGION'] = REGION"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%bash\n",
    "gcloud config set project $PROJECT\n",
    "gcloud config set compute/region $REGION\n",
    "\n",
    "## ensure we predict locally with our current Python environment\n",
    "gcloud config set ml_engine/local_python `which python`"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<img valign=\"middle\" src=\"images/tfx.jpeg\">"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "l9u699PmHJXU"
   },
   "source": [
    "### UCI Adult Dataset: https://archive.ics.uci.edu/ml/datasets/adult\n",
    "Predict whether income exceeds $50K/yr based on census data. Also known as \"Census Income\" dataset."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "DATA_DIR='gs://cloud-samples-data/ml-engine/census/data'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 51
    },
    "colab_type": "code",
    "id": "ksuSTsysHfZV",
    "outputId": "87adfbf0-be77-4d81-9162-5a2f9feffd90"
   },
   "outputs": [],
   "source": [
    "import os\n",
    "\n",
    "TRAIN_DATA_FILE = os.path.join(DATA_DIR, 'adult.data.csv')\n",
    "EVAL_DATA_FILE = os.path.join(DATA_DIR, 'adult.test.csv')\n",
    "!gcloud storage ls --long $TRAIN_DATA_FILE\n",
    "!gcloud storage ls --long $EVAL_DATA_FILE"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "QzCKD9Q8Gzry"
   },
   "source": [
    "## 1. Data Analysis\n",
    "For data analysis, visualization, and schema generation, we use [TensorFlow Data Validation](https://www.tensorflow.org/tfx/guide/tfdv) to perform the following:\n",
    "1. **Analyze** the training data and produce **statistics**.\n",
    "2. Generate data **schema** from the produced statistics.\n",
    "3. **Configure** the schema.\n",
    "4. **Validate** the evaluation data against the schema.\n",
    "5. **Save** the schema for later use."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "_S9QVPv3JYqM"
   },
   "source": [
    "### 1.1 Compute and visualise statistics"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "TETRPJXGJdWC"
   },
   "outputs": [],
   "source": [
    "HEADER = ['age', 'workclass', 'fnlwgt', 'education', 'education_num',\n",
    "               'marital_status', 'occupation', 'relationship', 'race', 'gender',\n",
    "               'capital_gain', 'capital_loss', 'hours_per_week',\n",
    "               'native_country', 'income_bracket']\n",
    "\n",
    "TARGET_FEATURE_NAME = 'income_bracket'\n",
    "TARGET_LABELS = [' <=50K', ' >50K']\n",
    "WEIGHT_COLUMN_NAME = 'fnlwgt'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# This is a convenience function for CSV. We can write a Beam pipeline for other formats.\n",
    "# https://www.tensorflow.org/tfx/data_validation/api_docs/python/tfdv/generate_statistics_from_csv\n",
    "train_stats = tfdv.generate_statistics_from_csv(\n",
    "    data_location=TRAIN_DATA_FILE, \n",
    "    column_names=HEADER,\n",
    "    stats_options=tfdv.StatsOptions(\n",
    "        weight_feature=WEIGHT_COLUMN_NAME,\n",
    "        sample_rate=1.0\n",
    "    )\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 545
    },
    "colab_type": "code",
    "id": "baquhTcwJncI",
    "outputId": "7fd3e7ee-e25e-4930-afda-4f3fb6dc9271"
   },
   "outputs": [],
   "source": [
    "tfdv.visualize_statistics(train_stats)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "YBWsZpDGJrBw"
   },
   "source": [
    "### 1.2 Infer Schema"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 904
    },
    "colab_type": "code",
    "id": "TLjiTL0VJyWl",
    "outputId": "3f2ff4c8-3ffe-4954-8236-529e83606e20"
   },
   "outputs": [],
   "source": [
    "schema = tfdv.infer_schema(statistics=train_stats)\n",
    "tfdv.display_schema(schema=schema)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "vQ9U99cWKYG2"
   },
   "outputs": [],
   "source": [
    "print(tfdv.get_feature(schema, 'age'))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "E8eJ8S-6Kdrt"
   },
   "source": [
    "### 1.3 Configure Schema"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "Gbdf-zRXKiAp"
   },
   "outputs": [],
   "source": [
    "# Relax the minimum fraction of values that must come from the domain for feature occupation.\n",
    "occupation = tfdv.get_feature(schema, 'occupation')\n",
    "occupation.distribution_constraints.min_domain_mass = 0.9\n",
    "\n",
    "# Add new value to the domain of feature native_country, assuming that we start receiving this\n",
    "# we won't be able to make great predictions of course, because this country is not part of our\n",
    "# training data.\n",
    "native_country_domain = tfdv.get_domain(schema, 'native_country')\n",
    "native_country_domain.value.append('Egypt')\n",
    "\n",
    "# All features are by default in both TRAINING and SERVING environments.\n",
    "schema.default_environment.append('TRAINING')\n",
    "schema.default_environment.append('EVALUATION')\n",
    "schema.default_environment.append('SERVING')\n",
    "\n",
    "# Specify that the class feature is not in SERVING environment.\n",
    "tfdv.get_feature(schema, TARGET_FEATURE_NAME).not_in_environment.append('SERVING')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 904
    },
    "colab_type": "code",
    "id": "bi7oUXuYL6Ec",
    "outputId": "79ea8944-496c-413e-9773-82de17076eb9"
   },
   "outputs": [],
   "source": [
    "tfdv.display_schema(schema=schema)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "DUtiulVrNZLU"
   },
   "source": [
    "### 1.4 Validate evaluation data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 111
    },
    "colab_type": "code",
    "id": "95Lit8o4NYuf",
    "outputId": "22f24a8d-4dca-4452-c9bf-5a326d88e659"
   },
   "outputs": [],
   "source": [
    "eval_stats = tfdv.generate_statistics_from_csv(\n",
    "    EVAL_DATA_FILE, \n",
    "    column_names=HEADER,\n",
    "    stats_options=tfdv.StatsOptions(\n",
    "        weight_feature=WEIGHT_COLUMN_NAME)\n",
    ")\n",
    "\n",
    "eval_anomalies = tfdv.validate_statistics(eval_stats, schema, environment='EVALUATION')\n",
    "tfdv.display_anomalies(eval_anomalies)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "TEm6B1qnOzTL"
   },
   "source": [
    "### 1.5 Freeze the schema"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "RAW_SCHEMA_LOCATION = 'raw_schema.pbtxt'"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "iz-WzwYvOzhG"
   },
   "outputs": [],
   "source": [
    "from tensorflow.python.lib.io import file_io\n",
    "from google.protobuf import text_format\n",
    "\n",
    "tfdv.write_schema_text(schema, RAW_SCHEMA_LOCATION)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!cat {RAW_SCHEMA_LOCATION}"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "Gsi_Hsh89Cl7"
   },
   "source": [
    "## License"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "0fOWx1yI9Dyn"
   },
   "source": [
    "Copyright 2019 Google LLC\n",
    "\n",
    "Licensed under the Apache License, Version 2.0 (the \"License\");\n",
    "you may not use this file except in compliance with the License.\n",
    "You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0.\n",
    "\n",
    "Unless required by applicable law or agreed to in writing, software\n",
    "distributed under the License is distributed on an \"AS IS\" BASIS,\n",
    "WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
    "See the License for the specific language governing permissions and\n",
    "limitations under the License.\n",
    "\n",
    "---\n",
    "**Disclaimer**: This is not an official Google product. The sample code provided for an educational purpose.\n",
    "\n",
    "---"
   ]
  }
 ],
 "metadata": {
  "colab": {
   "collapsed_sections": [],
   "name": "02-tfx_end_to_end",
   "provenance": [],
   "version": "0.3.2"
  },
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
